2020-APRIL-27

Generalised linear models

Objectives

  1. logistic regression models, which apply to binary responses;
  2. Poisson regression models, which apply to rates;
  3. over-dispersed poisson models.
  4. Zero-inflated poisson models, in which rates data contain an over-abundance of zeros.

Introduction

Our regression workflow is built on two imperatives:

  1. to clarify our assumptions;
  2. to clarify our decisions.

Linear regression: Fit a line

\[ y_i \sim Normal ( \mu_i, \sigma)\\ \sigma \sim Exponentional (1) \\ \mu_i \sim \alpha + \beta x_i \\ \boldsymbol{\eta} = \boldsymbol{\alpha} + \boldsymbol{X}\boldsymbol{\beta}\\ \]

Generalised linear model implies a ‘link’ function

Logistic regression

Link function for logistic regression

The binomial link function does two things

The logistic function satisfies the first task:

\(logit(\cdot) = log_e\frac{p}{(1-p)}\)

The inverse logit functionsatisfies the second task:

\(logit^-1(\cdot) = log\frac{e^\cdot}{(1-e^\cdot)}\)

We write this

\[ \Pr(y_i = 1) = p_i \\ logit(p_i) = \alpha + \beta x_i \\ \Pr(y_i = 1) = logit^{-1}(\alpha + \beta x_i) \]

Logistic regression model

Result

## Parameter   | Log-Odds |   SE |       95% CI |     z |      p
## -------------------------------------------------------------
## (Intercept) |     1.46 | 0.05 | [1.36, 1.55] | 30.38 | < .001

Equation form

\[ \log\left[ \frac { \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} } \right] = 1.46 \]

Intepretation using plogis

## [1] 0.811068

Logistic regression with a single co-variate.

Workflow: center and scale continuous predictors

Model syntax

Results

## Parameter       | Log-Odds |   SE |       95% CI |     z |      p
## -----------------------------------------------------------------
## (Intercept)     |     1.58 | 0.05 | [1.48, 1.69] | 29.37 | < .001
## Household.INC_s |     0.72 | 0.08 | [0.57, 0.89] |  8.86 | < .001

Equation form

\[ \log\left[ \frac { \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} } \right] = 1.58 + 0.72(\operatorname{Household.INC\_s}) \]

Interpretation not clear

## We fitted a logistic model (estimated using ML) to predict HomeOwner with Household.INC_s (formula: HomeOwner ~ Household.INC_s). The model's explanatory power is weak (Tjur's R2 = 0.04). The model's intercept, corresponding to Household.INC_s = 0, is at 1.58 (95% CI [1.48, 1.69], p < .001). Within this model:
## 
##   - The effect of Household.INC_s is significantly positive (beta = 0.72, 95% CI [0.57, 0.89], p < .001; Std. beta = 0.73, 95% CI [0.57, 0.89])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using

Graph your results!

Alternative graph

Spread of income

Range of income

The range of incomes is: 0, 2.5^{6}.

Sensitivity to outliers?

## [1] 0.979438

Alternative model

## Parameter       | Log-Odds |   SE |       95% CI |     z |      p
## -----------------------------------------------------------------
## (Intercept)     |     1.60 | 0.05 | [1.50, 1.71] | 29.21 | < .001
## Household.INC_s |     0.78 | 0.08 | [0.62, 0.95] |  9.27 | < .001

Range of predictors

Model with extreme values removed

Aside: a trick

scale_x_continuous(limits = c(-1.2, 4))

This code allowed me to constrain both graphs to the same x-axis scale. If I left this out the graphs would have looked like this:

What if we used lm

## Parameter       | Coefficient |       SE |       95% CI | t(2796) |      p
## --------------------------------------------------------------------------
## (Intercept)     |        0.81 | 7.27e-03 | [0.80, 0.83] |  111.88 | < .001
## Household.INC_s |        0.06 | 7.23e-03 | [0.04, 0.07] |    8.04 | < .001

Compare graphs

Logistic regression with a categorical covariate

## Parameter                               | Log-Odds |   SE |         95% CI |      z |      p
## --------------------------------------------------------------------------------------------
## (Intercept)                             |     2.13 | 0.10 | [ 1.94,  2.34] |  20.68 | < .001
## GenCohortGen_Silent *  born< 1946       |     0.26 | 0.28 | [-0.26,  0.85] |   0.94 | 0.347 
## GenCohortGenX *  born >=1961 & b.< 1980 |    -0.62 | 0.13 | [-0.87, -0.38] |  -4.90 | < .001
## GenCohortGenY *  born >=1980 & b.< 1996 |    -1.87 | 0.14 | [-2.15, -1.59] | -13.00 | < .001
## GenCohortGenZ *  born >= 1996           |    -4.91 | 1.04 | [-7.80, -3.30] |  -4.74 | < .001

Interpretation

Stratify by income

## Parameter                               | Log-Odds |   SE |         95% CI |      z |      p
## --------------------------------------------------------------------------------------------
## (Intercept)                             |     2.60 | 0.12 | [ 2.37,  2.84] |  21.79 | < .001
## GenCohortGen_Silent *  born< 1946       |     0.70 | 0.30 | [ 0.15,  1.33] |   2.35 | 0.019 
## GenCohortGenX *  born >=1961 & b.< 1980 |    -0.99 | 0.14 | [-1.26, -0.73] |  -7.31 | < .001
## GenCohortGenY *  born >=1980 & b.< 1996 |    -2.40 | 0.16 | [-2.71, -2.09] | -14.94 | < .001
## GenCohortGenZ *  born >= 1996           |    -4.99 | 1.06 | [-7.91, -3.32] |  -4.72 | < .001
## Household.INC_s                         |     1.19 | 0.10 | [ 0.99,  1.39] |  11.62 | < .001

Results

Add data to graph

Counts in cells

## 
## Gen Boombers: born >= 1946 & b.< 1961                Gen_Silent: born< 1946 
##                                  1024                                   208 
##          GenX: born >=1961 & b.< 1980          GenY: born >=1980 & b.< 1996 
##                                  1253                                   416 
##                    GenZ: born >= 1996 
##                                    17

Graph

Model selection

## # Comparison of Model Performance Indices
## 
## Name  | Model |      AIC |      BIC | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log |   PCP | Score_spherical
## ----------------------------------------------------------------------------------------------------------------
## home2 |   glm | 2590.320 | 2602.193 |     0.036 | 0.381 | 0.962 |    0.462 |      -Inf | 0.708 |                
## mg1   |   glm | 2516.922 | 2546.675 |     0.099 | 0.372 | 0.941 |    0.442 |      -Inf | 0.724 |       3.915e-04
## mg2   |   glm | 2272.307 | 2307.927 |     0.168 | 0.354 | 0.900 |    0.404 |      -Inf | 0.748 |       5.159e-04

Graph

Model accuracy: model 1

## # Accuracy of Model Predictions
## 
## Accuracy: 66.39%
##       SE: 2.72%-points
##   Method: Area under Curve

Model accuracy: model 2

## # Accuracy of Model Predictions
## 
## Accuracy: 77.02%
##       SE: 2.97%-points
##   Method: Area under Curve

Poisson regression (counts)

Link function for poisson regression

Example

Model syntax

## Parameter   | Log-Mean |   SE |       95% CI |     z |      p
## -------------------------------------------------------------
## (Intercept) |     1.01 | 0.11 | [0.79, 1.21] |  9.40 | < .001
## x           |     1.99 | 0.08 | [1.84, 2.14] | 26.24 | < .001

Results

\[ \log ({ \widehat{E( \operatorname{y} )} }) = 1.01 + 1.99(\operatorname{x}) \]

Graph

Compare with normal lm

# model
pois2 <- glm(y ~ x, data = fake) # remove "family = `poisson`)
# graph 
p_pois2 <- plot(ggeffects::ggpredict(pois2, terms = "x"),
                add.data = TRUE)
# render
p_pois2

Splines

Graph all three

Compare performance

Negative binomial models (over-dispersed Poissons)

Link function for a negative binomial model

\[ y_i \sim NegBinomial(\lambda_i,\phi)\\ log(\lambda_i) = \alpha +\beta x_i\\ \phi \sim gamma(.01,.01) \]

Example data

Fit model: poisson

## # Overdispersion test
## 
##        dispersion ratio =   5.068
##   Pearson's Chi-Squared = 496.695
##                 p-value = < 0.001

Model syntax: neg.bin

Graph poisson

Graph negative binomial

Linear model?

Zero-inflated poisson/ neg binomial regression

Link function for zero-inflated poisson

Example: hours volutneering

Proportion of non-volunteers

## [1] 0.6826594

Crude estimate of overdispersion

## [1] 1.680429

There’s about 1.68 more dispersion that a poisson model would expect

Formal zero-inflation check

## # Check for zero-inflation
## 
##    Observed zeros: 1992
##   Predicted zeros: 536
##             Ratio: 0.27

Formal overdispersion check

## # Overdispersion test
## 
##        dispersion ratio =    13.196
##   Pearson's Chi-Squared = 37595.179
##                 p-value =   < 0.001

Model syntax (brms package)

Results

##  Family: zero_inflated_poisson 
##   Links: mu = log; zi = identity 
## Formula: HoursCharity ~ Relid_s + Household.INC_s 
##    Data: nz (Number of observations: 2796) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept           1.70      0.02     1.67     1.73 1.00     3785     3068
## Relid_s             0.00      0.01    -0.02     0.03 1.00     4501     3362
## Household.INC_s    -0.09      0.02    -0.12    -0.05 1.00     3704     3054
## 
## Family Specific Parameters: 
##    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## zi     0.70      0.01     0.68     0.72 1.00     3849     2989
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Predicted effects of religion

Predicted effects of income:

Negative binomial zero-inflated model

Results

##  Family: zero_inflated_negbinomial 
##   Links: mu = log; shape = identity; zi = identity 
## Formula: HoursCharity ~ Relid_s + Household.INC_s 
##    Data: nz (Number of observations: 2796) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept           0.89      0.17     0.52     1.17 1.00      949      885
## Relid_s             0.21      0.06     0.11     0.33 1.00     1456     1698
## Household.INC_s    -0.07      0.05    -0.16     0.03 1.00     2061     2040
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape     0.28      0.07     0.16     0.42 1.00      938      914
## zi        0.35      0.11     0.07     0.50 1.00      939      697
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Compare models

##    elpd_diff se_diff
## b1     0.0       0.0
## b0 -1514.8     207.5

Translation

We can translate the loo_compare output into a waic convention as:

##    waic_diff       se
## b1     0.000   0.0000
## b0  3029.657 414.9648

What predicts the zeros?

##  Family: zero_inflated_negbinomial 
##   Links: mu = log; shape = identity; zi = identity 
## Formula: HoursCharity ~ Relid_s + Household.INC_s 
##    Data: nz (Number of observations: 2796) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept           0.89      0.17     0.52     1.17 1.00      949      885
## Relid_s             0.21      0.06     0.11     0.33 1.00     1456     1698
## Household.INC_s    -0.07      0.05    -0.16     0.03 1.00     2061     2040
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape     0.28      0.07     0.16     0.42 1.00      938      914
## zi        0.35      0.11     0.07     0.50 1.00      939      697
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Intepretation

The probability of non-volunteering in the preferred model for people who are at the mean Religious Identification and mean Household income in this population is plogis(.35) or 0.5866176. More often than not, we should predict zeros in this population. What predicts the zero component of the model? We can use this syntax:

ZI_NEGBIN Syntax

Results

  HoursCharity
Predictors Incidence Rate Ratios CI (95%)
Count Model
Intercept 3.40 2.85 – 3.95
Relid_s 1.02 0.94 – 1.11
Household.INC_s 0.93 0.84 – 1.03
Zero-Inflated Model
Intercept 1.06 0.72 – 1.38
Relid_s 0.52 0.42 – 0.61
Household.INC_s 1.02 0.87 – 1.17
Observations 2796
R2 Bayes 0.015

Interpretation: relid

Interpretation: Household income

Better graphs

## NULL

Model selection

##    elpd_diff se_diff
## b2     0.0       0.0
## b1   -50.0      10.1
## b0 -1564.8     206.1

Model Selection

##    waic_diff        se
## b2    0.0000   0.00000
## b1  100.0416  20.14189
## b0 3129.6983 412.13260

Posterior predictive checks:poisson

Zero-inflated negative binomial with no predictors for the zero-inflation part.

Posterior predictive checks:zipoisson zinegbin

Posterior predictive checks: zinegbin

Appendix 1

If you want to graph each predictor separately emply the [[1]] or [[2] syntax as follows:]

Predicted effects of religious identification

Aside

Plot

Acknowledgments and References

Gelman, Hill, and Vehtari (2020)

Kurz (2020)

Bürkner and Vuorre (2019)

Bibliography

Bürkner, Paul-Christian, and Matti Vuorre. 2019. “Ordinal Regression Models in Psychology: A Tutorial.” Advances in Methods and Practices in Psychological Science 2 (1): 77–101.

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.

Kurz, A. Solomon. 2020. Zenodo. https://doi.org/10.5281/zenodo.4080013.